Three Kinds of Noise
Stable Diffusion generates images by denoising a random latent, a 4-channel 64x64 tensor drawn from a Gaussian distribution. The text prompt guides the denoising, steering every pixel toward the same concept. But there are three separate spaces where you can inject spatial variation, and they each control different things.
Visual noise is the standard latent space noise. It determines texture, color placement, fine detail.
Concept noise operates in CLIP's 512-dimensional text embedding space. Instead of encoding the entire image with a single prompt, I define spatial masks that assign different prompts to different regions: the left third is wheat field, the right third is ocean, the middle morphs between them. Each location gets its own 512-dimensional embedding vector, blended according to mask weights. The masks are 512x64x64 tensors.
Feature noise operates in the 768-dimensional cross-attention feature space, where text conditioning meets visual generation inside the U-Net. Perturbing features here affects higher-order properties: the style of edges, the feel of lighting, the texture vocabulary the model draws from. Feature masks are 768x64x64 tensors.
The Modulation Formula
The per-location embedding is:
location_embedding = base_embedding + mask_vector * 0.1 + noise
The 0.1 scaling factor was found empirically. Too high and spatial transitions become jarring, with hard edges between concepts. Too low and the masks have no visible effect. The useful range is narrow: the difference between 0.08 and 0.12 is the difference between "barely visible" and "broken," and the range shifts with every model checkpoint.
The relationship between mask weight and noise is inverse: high-weight regions get less noise (the concept is pinned), low-weight regions get more (the concept is free to morph). This creates anchor regions with stable content surrounded by fluid zones where the model interpolates. A wheat field stays a wheat field, but the space between the wheat field and the ocean shimmers through transitional forms: marshes, sandy beaches, golden waves that are ambiguously both water and grain.
Sinusoidal modulation of the noise parameters over time produces animations where fluid regions breathe and morph while anchored regions hold still.
The Gradio Interface
I built a Gradio interface that lets you paint concept masks directly onto a canvas: select a text concept, choose a brush size, paint regions. Controls for each noise dimension have independent strength sliders. You can zero out any dimension to isolate its contribution.
The most interesting results come from restraint. Two or three concepts with generous transition zones produce images with internal logic: the model finds visual bridges between concepts that a human compositor wouldn't think of. A wheat field doesn't just abut an ocean; the transition zone might generate a marsh, or a sandy beach, or golden waves that are ambiguously both water and grain.
What Didn't Work
Full feature noise, perturbing all 768 dimensions simultaneously, produces visual chaos. The cross-attention features aren't independent; they form a correlated structure, and random perturbation breaks the correlation. The output looks like neural network artifacts, not artistic variation.
Selective feature noise (perturbing only a subset of dimensions that correspond to style properties, found through trial and error with feature ablation) helps, but it's fragile and model-specific. A different Stable Diffusion checkpoint has different feature semantics.
Temporal variation has limits too. Sinusoidal modulation gives smooth but periodic motion: after one cycle, the animation repeats exactly. Perlin noise in time and random walks with momentum produce more interesting animations but are harder to control and sometimes diverge into incoherence.
What I Learned
The regions between known concepts in embedding space aren't empty. They're full of coherent visual ideas the model can render but that no text prompt would naturally describe. Exploring that geometry, with spatial masks as the navigation tool, is a form of generative art that feels distinct from prompt engineering.
The scaling factor controls everything. Three base values, one exponent, and the boundary between "looks intentional" and "looks broken" is paper-thin. Unlimited variation is just noise; the images worth looking at come from a small number of anchored concepts with the model finding the transitions.
The concept mask approach is architecture-agnostic (only the embedding dimensions change) and should work with SDXL or Flux. Real-time interactive generation, painting concepts and watching the image update continuously, would need streaming diffusion on a high-end GPU. The Gradio prototype is partway there.